Vectorize `remove_copy` and `unique_copy` by AlexGuteniev · Pull Request #5355 · microsoft/STL

AlexGuteniev · 2025-03-23T14:39:49Z

⚙️ The optimization

remove_copy and unique_copy are different from their non-_copy counterparts in that they don't have room they are allowed to overwrite. This means we can't directly store results from vector registers.

The previous attempt #5062 tried to use masked stores to bypass that limitation. Unfortunately, this doesn't perform well for some CPUs. Also the minimum granularity of AVX2 masked store is 32 bits, so it would not work for smaller elements.

This time, temporary storage comes to rescue. The algorithms already use some additional memory (the tables), so why wouldn't it use a bit more. I arbitrarily picked 512 bytes, should be not too much. Each time the temporary buffer is full, it can be copied to the destination with memcpy, it should be fast enough for this buffer size.

🚫 No `find` before `remove_copy`

In #4987, it was explained that doing find before remove is good for both correctness and performance. Originally it was in vectorization code, but during the review @StephanTLavavej observed that it is done in the headers already (#4987 (comment)).

For remove_copy it is not necessary for correctness, and may be harmful for performance. find would need copy in addition, this will be double pass on the input, which can make the performance worse for large input and memry-bound situation.

We may have special handling of the range before the first match in vectorization code, this is another story, and it would not be harmful, but I'm not doing this in the current PR. Maybe later.

So, as we have not called find, and so have not checked if value type can even match iterator value type, we need this _Could_compare_equal_to_value_type check here.

✅ Test coverage

Shared with non-_copy counterparts to save total tests run time and some lines of code, at the expense with otherwise unnecessary coupling.

We check both modified and unmodified destination parts, to make sure unmodified indeed didn't modify.

⏱️ Benchmark results

Benchmark	Before	After	Speedup
rc<alg_type::std_fn. std::uint8_t>	908 ns	349 ns	2.60
rc<alg_type::std_fn. std::uint16_t>	1850 ns	462 ns	4.00
rc<alg_type::std_fn. std::uint32_t>	901 ns	532 ns	1.69
rc<alg_type::std_fn. std::uint64_t>	1876 ns	1018 ns	1.84
rc<alg_type::rng. std::uint8_t>	1344 ns	349 ns	3.85
rc<alg_type::rng. std::uint16_t>	2094 ns	465 ns	4.50
rc<alg_type::rng. std::uint32_t>	884 ns	460 ns	1.92
rc<alg_type::rng. std::uint64_t>	1884 ns	1079 ns	1.75
uc<alg_type::std_fn. std::uint8_t>	3329 ns	263 ns	12.66 🤡
uc<alg_type::std_fn. std::uint16_t>	1145 ns	342 ns	3.35
uc<alg_type::std_fn. std::uint32_t>	1144 ns	388 ns	2.95
uc<alg_type::std_fn. std::uint64_t>	1128 ns	754 ns	1.50
uc<alg_type::rng. std::uint8_t>	1111 ns	252 ns	4.41
uc<alg_type::rng. std::uint16_t>	1328 ns	331 ns	4.01
uc<alg_type::rng. std::uint32_t>	1313 ns	386 ns	3.40
uc<alg_type::rng. std::uint64_t>	1146 ns	758 ns	1.51

🥇 Results interpretation

Good improvement!

Not as good as for non-_copy counterparts though, as memcpy takes some noticeable time.

The usual codegen gremlins that cause results variation are observed for non-vectorized tight loops. I've marked the most notorious one with clown. I can't explain that anomality.

Less error prone, especially if implementing _copy someday

# Conflicts: # benchmarks/src/unique.cpp # stl/inc/algorithm # stl/src/vector_algorithms.cpp

…ue/unique_copy.

stl/inc/algorithm

benchmarks/src/unique.cpp

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

StephanTLavavej · 2025-04-22T09:08:43Z

Thanks! 😻 I pushed fixes for significant bugs in the "can compare equal to value type" codepaths, and expanded the test coverage (auditing all existing codepaths, and verifying that the new tests caught the bugs). Please double-check.

5950X results:

Benchmark	Before	After	Speedup
`rc<alg_type::std_fn, std::uint8_t>`	1256 ns	373 ns	3.37
`rc<alg_type::std_fn, std::uint16_t>`	1820 ns	440 ns	4.14
`rc<alg_type::std_fn, std::uint32_t>`	1221 ns	630 ns	1.94
`rc<alg_type::std_fn, std::uint64_t>`	1263 ns	1303 ns	0.97
`rc<alg_type::rng, std::uint8_t>`	1866 ns	376 ns	4.96
`rc<alg_type::rng, std::uint16_t>`	1248 ns	440 ns	2.84
`rc<alg_type::rng, std::uint32_t>`	1238 ns	625 ns	1.98
`rc<alg_type::rng, std::uint64_t>`	1251 ns	1338 ns	0.93
`uc<alg_type::std_fn, std::uint8_t>`	1193 ns	407 ns	2.93
`uc<alg_type::std_fn, std::uint16_t>`	1174 ns	554 ns	2.12
`uc<alg_type::std_fn, std::uint32_t>`	1599 ns	586 ns	2.73
`uc<alg_type::std_fn, std::uint64_t>`	1320 ns	1101 ns	1.20
`uc<alg_type::rng, std::uint8_t>`	1213 ns	407 ns	2.98
`uc<alg_type::rng, std::uint16_t>`	1723 ns	439 ns	3.92
`uc<alg_type::rng, std::uint32_t>`	1239 ns	543 ns	2.28
`uc<alg_type::rng, std::uint64_t>`	1242 ns	1101 ns	1.13

AlexGuteniev · 2025-04-22T09:33:18Z

Please double-check.

All looks good.

StephanTLavavej · 2025-04-22T09:49:18Z

/azp run STL-ASan-CI

azure-pipelines · 2025-04-22T09:49:29Z

Azure Pipelines successfully started running 1 pipeline(s).

StephanTLavavej · 2025-04-22T10:16:13Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-04-22T20:56:27Z

I resolved a trivial adjacent-add conflict with #5352 in <algorithm>.

StephanTLavavej · 2025-04-22T21:51:10Z

Thanks for these unique speedups, removing all that execution time! 😹 🚀 🤪

AlexGuteniev and others added 20 commits November 16, 2024 22:48

unique vectorization

cffb1e7

no point

a0b714d

Not unique problem

cccf693

Pointed out coverage

54781db

Deduplicate

fa4ff20

Less error prone, especially if implementing _copy someday

Mention unique shuffling requirement

407897e

whitespace

4ca596b

Merge branch 'main' into unique

54b2938

Include <type_traits> for conditional_t.

a8b1f3b

Direct-init vector instead of calling resize().

144163b

Drop std::.

a2bb838

fix types in pointer test

5ea3d5a

simplify unique pointer test

c53a430

<memory> is no longer used.

b789701

Drop repeated TD alias.

2f51ed2

Value-init ptr_val_array.

8eb10d7

When is_pointer_v<T>, dis(gen) returns int.

2e9f007

Mark _Unique_fallback as noexcept.

e35ed5f

Avoid abbreviated function templates.

8928190

Vectorize remove_copy and unique_copy

7a1de3d

github-project-automation bot added this to STL Code Reviews Mar 23, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 23, 2025

This comment was marked as resolved.

Sign in to view

StephanTLavavej added the performance Must go faster label Mar 23, 2025

StephanTLavavej moved this from Initial Review to Work In Progress in STL Code Reviews Mar 23, 2025

This comment was marked as off-topic.

Sign in to view

AlexGuteniev marked this pull request as ready for review March 25, 2025 05:28

AlexGuteniev requested a review from a team as a code owner March 25, 2025 05:28

Merge remote-tracking branch 'upstream/main' into copycats

d37fde4

# Conflicts: # benchmarks/src/unique.cpp # stl/inc/algorithm # stl/src/vector_algorithms.cpp

AlexGuteniev force-pushed the copycats branch from 5f31ec5 to d37fde4 Compare March 25, 2025 05:53

StephanTLavavej added 11 commits April 21, 2025 15:36

Comment grammar.

1d9e0bc

Move iterators when returning from _Remove_copy_unchecked.

66a7511

Add const.

43006ad

Add newline.

d017014

Directly construct vectors from size.

e33242e

LKG_remove_copy should take InIt, OutIt.

443e7d4

test_case_unique_copy should be called with out_actual.

98868f5

Update reserve calls.

fff4056

Test the rest of the return values of ranges::remove/remove_copy/uniq…

51d9388

…ue/unique_copy.

Add test coverage to find the bugs.

640bb13

Fix bugs.

62401d4

StephanTLavavej reviewed Apr 22, 2025

View reviewed changes

StephanTLavavej approved these changes Apr 22, 2025

View reviewed changes

StephanTLavavej removed their assignment Apr 22, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 22, 2025

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 22, 2025

Merge branch 'main' into copycats

fbf0e0c

StephanTLavavej merged commit 38d0c04 into microsoft:main Apr 22, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 22, 2025

AlexGuteniev deleted the copycats branch April 22, 2025 21:51

StephanTLavavej mentioned this pull request May 3, 2025

Test coverage for search_n with impossible value #5440

Merged

AlexGuteniev mentioned this pull request May 5, 2025

<algorithm>: use _Verify_ranges_do_not_overlap function in more places #5472

Closed

AlexGuteniev mentioned this pull request May 13, 2025

Vectorize rotate better #5502

Merged

Conversation

AlexGuteniev commented Mar 23, 2025

⚙️ The optimization

🚫 No find before remove_copy

✅ Test coverage

⏱️ Benchmark results

🥇 Results interpretation

Uh oh!

This comment was marked as resolved.

This comment was marked as off-topic.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexGuteniev commented Apr 22, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

azure-pipelines bot commented Apr 22, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🚫 No `find` before `remove_copy`

StephanTLavavej commented Apr 22, 2025 •

edited

Loading